Boxplots#

This page contains instructions and documentation for creating plots used to visualize curve ensembles.

spaghetti_plot#

Plots a random selection of curves.

Parameters#

client (bigquery.Client): BigQuery client object.

table_name (str): BigQuery table name containing data in ‘dataset.table’ form.

reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.

geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.

geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.

geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.

reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.

value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.

n (int, optional): Number of curves to plot. Defaults to 25.


Returns#

fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.


Example#

import epidemic_intelligence as ei
from google.oauth2 import service_account
from google.cloud import bigquery

credentials = service_account.Credentials.from_service_account_file('../../../credentials.json') # use the path to your credentials
project = 'net-data-viz-handbook' # use your project name
# Initialize a GC client
client = bigquery.Client(credentials=credentials, project=project)

table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label' 
geo_values = 'Portland(US-ME)' 
value = 'Infectious_18_23'

sp_fig = ei.spaghetti_plot(
    client=client,
    table_name=table_name,
    reference_table=reference_table,
    geo_level=geo_level,
    geo_values=geo_values,
    geo_column=geo_column,
    reference_column=reference_column,
    value=value,
    n=100)

# finishing touches
sp_fig.update_layout(width=900, height=500, 
                     showlegend=True, 
                     font_family='PT Sans Narrow', 
                     title='Spaghetti Plot',)
sp_fig.show()

functional_boxplot#

A functional boxplot uses curve-based statistics that treat entire curves as a single data point, as opposed to each observation in a curve. Always plots the median and interquartile range.

Parameters#

client (bigquery.Client): BigQuery client object.

table_name (str): BigQuery table name containing data in ‘dataset.table’ form.

reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.

geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.

geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.

geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.

reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.

value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.

num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.

num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.

grouping_method (str, optional): Method used to group curves. Must be one of:

  • 'mse' (default): Fixed-time pairwise mean squared error between curves.

  • 'abc': Fixed-time pairwise area between curves. Also called mean absolute error.

kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.

centrality_method (str, optional): Method used to determine curve centrality within their group. Must be one of:

  • 'mse' (default): Summed fixed-time mean squared error between curves.

  • 'abc': Summed fixed-time pairwise area between curves. Also called mean absolute error.

  • 'mbd': Modified band depth. For more information, see Sun and Genton (2011).

threshold (float, optional): Number of interquantile ranges from median curve must be to not be considered an outlier. Defaults to 1.5.

dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.

delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.

overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.


Returns#

fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.


Example#

# required
table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label'
geo_values = 'Portland(US-ME)'
value = 'Infectious_18_23'

# Set parameters for grouping
num_clusters = 1
num_features = 20 
grouping_method = 'mse' # mean squared error
centrality_method = 'mse' # mean squared error

dataset = None
delete_data = True

fbp_fig = ei.functional_boxplot(
    client=client,
    table_name=table_name,
    reference_table=reference_table,
    geo_level=geo_level,
    geo_values=geo_values,
    geo_column=geo_column,
    reference_column=reference_column,
    value=value,
    num_clusters=num_clusters,
    num_features=num_features,
    grouping_method=grouping_method,
    centrality_method=centrality_method,
    dataset=dataset,
    delete_data=delete_data,
    overwrite=True
)

# finishing touches
fbp_fig.update_layout(width=900, height=500, 
                     showlegend=True, 
                     font_family='PT Sans Narrow', 
                     title='Functional Boxplot',
                     yaxis_title="Infectious 18-23yo"
)
fbp_fig.show()
Dataset `net-data-viz-handbook.d9049d69404de9710ad9d14e8a742048fd390617f1f5b40cfbe2ca03fe2f7db1` created.

fixed_time_boxplot#

A fixted-time boxplot uses fixed-time statistics that rank each point at each time step, and use those to construct confidence intervals for each time step. Always plots the median and interquartile range.

Parameters#

client (bigquery.Client): BigQuery client object.

table_name (str): BigQuery table name containing data in ‘dataset.table’ form.

reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.

geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.

geo_values (str or listlike or None): The source(s) to be included. A value or subset of values from the geo_level column. If None, then all values will be included.

geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.

reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.

value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.

num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.

num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.

grouping_method (str, optional): Method used to group curves. Must be one of:

  • 'mse' (default): Fixed-time pairwise mean squared error between curves.

  • 'abc': Fixed-time pairwise area between curves. Also called mean absolute error.

kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.

dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.

delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.

overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.

confidence (float, optional): From 0 to 1. Confidence level of interval that will be graphed. Also determines which points are considered outliers.

full_range (bool, optional): If True, then mesh will be drawn around entire envelope, including outliers. Defaults to False.

outlying_points (bool, optional): If True, then outlying points will be graphed. Defaults to True.


Returns#

fig (plotly.graph_objects.Figure): Plotly Figure containing visualization.


Example#

# required
table_name = 'h1n1_R2.basins_prevalence_agg'
reference_table = 'reference.gleam-geo-map'
reference_column = 'basin_id' # name of a column in reference table
geo_column = 'basin_id' # name of a column in table corresponding to column in reference table
geo_level = 'basin_label'
geo_values = 'Portland(US-ME)'
value = 'Infectious_18_23'

# Set parameters for grouping
num_clusters = 1
num_features = 20 
grouping_method = 'mse' # mean squared error
confidence = .95

dataset = None
delete_data = True

ft_fig = ei.fixed_time_boxplot(
    client,
    table_name,
    reference_table,
    geo_level,
    geo_values,
    geo_column=geo_column,
    reference_column=reference_column,
    num_clusters=num_clusters,
    num_features=num_features,
    grouping_method=grouping_method,
    value=value,
    dataset=dataset,
    delete_data=delete_data,
    kmeans_table=False,
    confidence=confidence,
    full_range=True,
    outlying_points=False,
)

# finishing touches
ft_fig.update_layout(width=900, height=500, 
                     showlegend=True, 
                     font_family='PT Sans Narrow', 
                     title='Traditional Boxplot',)
ft_fig.update_layout(showlegend=True)
Dataset `net-data-viz-handbook.d9ca9c3939e3ac94d2ea3763e59a5bd13a7c5c7f2ebfad90cd6dcab13f2a499b` created.
BigQuery dataset `net-data-viz-handbook.d9ca9c3939e3ac94d2ea3763e59a5bd13a7c5c7f2ebfad90cd6dcab13f2a499b` removed successfully, or it did not exist.

fetch_fixed_time_quantiles#

Allows calculation of custom fixed-time quantiles. Always fetches median.

Parameters#

client (bigquery.Client): BigQuery client object.

table_name (str): BigQuery table name containing data in ‘dataset.table’ form.

reference_table (str): BigQuery table name containing reference table in ‘dataset.table’ form.

confidences (list of float): List of confidences to gather, from 0 to 1. For example, entering .5 will result in the 25th and 75th percentiles being calculated.

geo_level (str): The name of a column from the reference table. The geographical level used to determine what places are included.

geo_values (str or listlike or None): The geographies to be included. A value or subset of values from the geo_level column. If None, then all values will be included.

geo_column (str, optional): Name of column in original table containing geography identifier. Defaults to ‘basin_id’.

reference_column (str, optional): Name of column in original table containing the geography corresponding to data in source_column and target_column. Defaults to ‘basin_id’.

value (str, optional): Name of column in the original table containing the importation value to be analyzed. Defaults to ‘value’.

num_clusters (int, optional): Number of clusters that curves will be broken into based on grouping_method. Defaults to 1. Note: raising num_clusters above one significantly increases runtime.

num_features (int, optional): Number of features the kmeans algorithm will use to group curves if num_clusters in greater than 1. Must be less than or equal to number of run_ids in table.

grouping_method (str, optional): Method used to group curves. Must be one of:

  • 'mse' (default): Fixed-time pairwise mean squared error between curves.

  • 'abc': Fixed-time pairwise area between curves. Also called mean absolute error.

kmeans_table (str, optional): BigQuery table name containing clustering information in ‘dataset.table’ form. Used when kmeans has already been performed with delete_data=False. Allows function to skip costly kmeans algorithm.

dataset (str or None, optional): Name of BigQuery dataset to store intermediate tables. If None, then random hash value will be used. Defaults to None.

delete_data (bool, optional): If True, then intermediate data tables will not be deleted. Defaults to False.

overwrite (bool, optional): If True, then will not prompt for confirmation if overwriting an existing BigQuery dataset. Defaults to False.


Returns#

df (pandas.DataFrame): pandas dataframe containing quantiles and median.


Example#

# uses the same parameters as fixed_time_boxplot!
df_ft = ei.boxplots.fetch_fixed_time_quantiles(
    client=client,
    table_name=table_name,
    reference_table=reference_table,
    confidences=[.9, .5], # just introduce the confidences parameter
    geo_level=geo_level,
    geo_values=geo_values,
    geo_column=geo_column,
    reference_column=reference_column,
    num_clusters=num_clusters,
    num_features=num_features,
    grouping_method=grouping_method,
    value=value,
    dataset=dataset,
    delete_data=delete_data,
    kmeans_table=False,
)

df_ft
Dataset `net-data-viz-handbook.65dccebd9c110905d6de3a5101c56f430697943bc506cef33144c40f307e6540` created.